Qries

Context

  • In Wuhan, China’s Hubei province capital, a novel coronavirus known as 2019-nCoV was discovered for the first time.
  • People had pneumonia for no apparent reason and for which there were no effective vaccines or therapies.
  • The virus has been found to transmit from person to person.
  • In mid-January 2020, the transmission rate (rate of infection) appeared to be increasing.

COVID-19

Qries

Image source zmescience

The SARS-CoV-2 virus causes Coronavirus Disease (COVID-19), an infectious disease. The majority of patients infected with COVID-19 will have mild to moderate symptoms and will recover without any additional therapy. Some, on the other hand, will become critically unwell and require medical assistance.

The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe. These particles range from larger respiratory droplets to smaller aerosols. You can be infected by breathing in the virus if you are near someone who has COVID-19, or by touching a contaminated surface and then your eyes, nose or mouth. The virus spreads more easily indoors and in crowded settings.

Objective

  • The COVID-19 threat is growing increasingly serious.
  • We can successfully employ tools and techniques to prevent this threat by using data.
  • Data is one valuable resource that could aid in the development of effective defences against a pandemic of this magnitude.
  • Leveraging domain-based data will also aid in the development of vaccinations and other medical treatments.
  • Diverse analyses and research have been undertaken on existing COVID-19 data, but little has been explored about its implications, distribution, and influence on various geographies.
  • We are attempting to evaluate and investigate the effects of pandemics in order to get important insights that will aid us in defending against pandemics in the present and future.

Preliminary analysis

This is a comprehensive analysis report of the Novel Coronavirus (COVID-19) around the world, to demonstrate data processing and visualization, insights and prediction.

Here we are basically given with three main dataset.

  1. time_series_covid19_confirmed_global.csv
  2. time_series_covid19_deaths_global.csv
  3. time_series_covid19_deaths_global.csv

As a first step let’s look at each one of them. Here as we can see, for the first table, we have the country name, latitude, longitude information, and then the number of cases confirmed as the time progress. Similarly we can can see the second and thrid dataset we can see the death rate and recovery rate as the time progress.

Table 1

raw.data.confirmed <- read.csv('time_series_covid19_confirmed_global.csv')
head(raw.data.confirmed, n=5L)

Table 2

raw.data.deaths <- read.csv('time_series_covid19_deaths_global.csv')
head(raw.data.deaths, n=5L)

Table 3

raw.data.recovered <- read.csv('time_series_covid19_recovered_global.csv')
head(raw.data.recovered, n=5L)

Exploring the rows and columns

Here we can observe some discrepancy in last dataframe. The first two dataframe consist of 284 observation of 826 variables. While, the third column consist of 269 observation of 826 variables. This means that recovery information of few instances is not yet available. Here we can also observe that Province.state column is empty for all observations, hence we can drop that column in further analysis.

Table 1

str(raw.data.confirmed,list.len=10)
## 'data.frame':    284 obs. of  826 variables:
##  $ Province.State: chr  "" "" "" "" ...
##  $ Country.Region: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Lat           : num  33.9 41.2 28 42.5 -11.2 ...
##  $ Long          : num  67.71 20.17 1.66 1.52 17.87 ...
##  $ X1.22.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.23.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.24.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.25.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.26.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.27.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

Table 2

str(raw.data.deaths,list.len=10)
## 'data.frame':    284 obs. of  826 variables:
##  $ Province.State: chr  "" "" "" "" ...
##  $ Country.Region: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Lat           : num  33.9 41.2 28 42.5 -11.2 ...
##  $ Long          : num  67.71 20.17 1.66 1.52 17.87 ...
##  $ X1.22.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.23.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.24.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.25.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.26.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.27.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

Table 3

str(raw.data.recovered, list.len=10)
## 'data.frame':    269 obs. of  826 variables:
##  $ Province.State: chr  "" "" "" "" ...
##  $ Country.Region: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Lat           : num  33.9 41.2 28 42.5 -11.2 ...
##  $ Long          : num  67.71 20.17 1.66 1.52 17.87 ...
##  $ X1.22.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.23.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.24.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.25.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.26.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.27.20      : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

Reshaping the data

Here as we can see, the columns are with dates and rows are with counts of its corresponding confirmed cases, detaths and recovery counts. For proper analysis of this data let’s reshape the data into a longer dataset.

Table 1

raw.data.confirmed <- raw.data.confirmed %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "confirmed_count")
raw.data.confirmed

Table 2

raw.data.deaths <- raw.data.deaths %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "death_count")
raw.data.deaths

Table 3

raw.data.recovered <- raw.data.recovered %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "recovered_count")
raw.data.recovered

Correcting the date String

Now here we can see the date and its corresponding counts (confirmed, dath and recovery), in that particular dates. Now we can observe that these dates.

Table 1

raw.data.confirmed$Date <- substr(raw.data.confirmed$Date,2,20)
raw.data.confirmed

Table 2

raw.data.deaths$Date <- substr(raw.data.deaths$Date,2,20)
raw.data.deaths

Table 2

raw.data.recovered$Date <- substr(raw.data.recovered$Date,2,20)
raw.data.recovered

Merging For easy Analysis

Now lets merge three dataframes for easy comparison of the data.

data = merge(x = raw.data.confirmed , y = raw.data.deaths, by = c("Province.State","Country.Region","Lat","Long","Date"))
data = merge(x = data , y = raw.data.recovered, by = c("Province.State","Country.Region","Lat","Long","Date"))
data <- data[order(as.Date(data$Date, format="%m.%d.%Y")),]
data

Here we merged, three data frames and ordered as per date.

Exploring the missigness in Data

Analysing the missingness in data is another important aspect of data analysis. Eventhough for this analysis we are not trying to impute missing data, let’s explore analyse how much missing data is present.

data[data==""]<-NA
miss_var_summary(data)

Data into different time frames

data <- separate(data, Date, into=c("month", "day", "year"), sep="\\.", remove = FALSE)
data

Detailed analysis

From here onwards we are going to analyze data. Here our analysis is divided based on perpectives. By prepective, it meases that we are analysing data in different point of view which includes,

  1. Analysis in the perpective of world
  2. Analysis specific to certain countries

Data

First let’s store our data into a world_data variable so that we could manupulate these without disturbing our base data.

world_data <- data
world_data_by_countries <- world_data %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              Lat=max(Lat),
                              Long=max(Long))
world_data_by_countries

Here, you can see severe discrepancy within cases confirmed and the corresponding deaths and recovered data. This is happened because after certain time frame, the data regarding deaths and recovered is not made available within database.

COVID-19 Stats

Now, Let’s explore some quick stats about novel coronavirus 2019.

499255147
Total Confirmed
135566620
Total Recovered
6157247
Total Death

Plot - Top Affected (Stat)

Here is the top 20 countries having highest number of confirmed cases.

top_20_countries_c <- top_n(world_data_by_countries, 20, confirmed)
confirmed_plot <- ggplot(top_20_countries_c, aes(x=Country.Region, y=confirmed)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=confirmed)) + coord_flip() + scale_fill_continuous(type = "viridis") + scale_y_log10() + labs(x="\nCountry", y="Confirmed cases\n") + theme_bw() +
theme(axis.text.x=element_text(angle=45, vjust=0.5))
confirmed_plot

top_20_countries_r <- top_n(world_data_by_countries, 20, recovered)
recovered_plot <- ggplot(top_20_countries_r, aes(x=Country.Region, y=recovered)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=recovered)) + coord_flip() + scale_fill_continuous(type = "viridis") + scale_y_log10() + labs(x="\nCountry", y="Recovered cases\n") + theme_bw() +
theme(axis.text.x=element_text(angle=45, vjust=0.5))
recovered_plot

Here is the top 20 countries having highest number of confirmed cases.

top_20_countries_d <- top_n(world_data_by_countries, 20, death)
death_plot <- ggplot(top_20_countries_d, aes(x=Country.Region, y=death)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=death)) + coord_flip() + scale_fill_continuous(type = "viridis") + scale_y_log10() + labs(x="\nCountry", y="Death count\n") + theme_bw() +
theme(axis.text.x=element_text(angle=45, vjust=0.5))
death_plot

Plot - Least Affected Country/region

Now, let’s see the countries/Regions least affected by COVID-19 virus.

least_20_countries_c <- top_n(world_data_by_countries, -20, confirmed)
least_c <- ggplot(least_20_countries_c, aes(x=Country.Region, y=confirmed)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=confirmed)) + coord_flip()
least_c

least_20_countries_r <- top_n(world_data_by_countries, -20, recovered)
least_r <- ggplot(least_20_countries_r, aes(x=Country.Region, y=recovered)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=recovered)) + coord_flip()
least_r

least_20_countries_d <- top_n(world_data_by_countries, -20, death)
least_d <- ggplot(least_20_countries_d, aes(x=Country.Region, y=death)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=death)) + coord_flip()
least_d

### Plot - Top Affected (Map)

top_20_countries_d
leaflet(options=leafletOptions(dragging=FALSE, minzoom=18, maxzoom=18, nowrap=TRUE)) %>% addProviderTiles("CartoDB", group="CartoBD") %>%
addCircleMarkers(data = top_20_countries_c, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Top 20 confirmed") %>%
addCircleMarkers(data = top_20_countries_r, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Top 20 recovered") %>% 
  addCircleMarkers(data = top_20_countries_d, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Top 20 death") %>% 

addCircleMarkers(data = least_20_countries_c, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Least 20 confirmed") %>% 
addCircleMarkers(data= least_20_countries_r, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Least 20 recovered") %>% 
addCircleMarkers(data= least_20_countries_d, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Least 20 deaths") %>% 
addLayersControl(baseGroups = c("Top 20 confirmed","Top 20 recovered","Top 20 death","Least 20 confirmed", "Least 20 recovered", "Least 20 deaths"), options = layersControlOptions(collapsed = FALSE))

How All these begin

Now, let’s analyze how all these begin and how the virus got progressed.

First half and Second Half 2020

So here let’s first filter first 6 months covid 19 data.

world_data <- data
world_data <- transform(world_data, month = as.numeric(month), 
                    year = as.numeric(year), day=as.numeric(day)) %>% mutate_at(c("confirmed_count"), ~(scale(.)*10 %>% as.vector))
first_months <- world_data %>% filter(year==20)  %>% filter(month==1) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long)) 

second_months <- world_data %>% filter(year==20)  %>% filter(month==2) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long)) 
third_months <- world_data %>% filter(year==20)  %>% filter(month==3) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long))
forth_months <- world_data %>% filter(year==20)  %>% filter(month==4) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long)) 
fifth_months <- world_data %>% filter(year==20)  %>% filter(month==5) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long)) 
sixth_months <- world_data %>% filter(year==20)  %>% filter(month==6) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long))
second_6_months <- world_data %>% filter(year==20)  %>% filter(month<=12) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long))
world_data

Plot 2020 Growth

So here let’s first filter first 6 months covid 19 data.

library(leaflet)
pal = colorNumeric(
  palette = "viridis",
  domain = world_data$confirmed
)

leaflet() %>% 
  addProviderTiles("CartoDB", group="CartoBD",options=providerTileOptions(nowrap=TRUE)) %>%
addCircleMarkers(data = first_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(first_months$confirmed), radius= ~confirmed*4, group="First month") %>% 
addCircleMarkers(data = second_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_months$confirmed), radius= ~confirmed*4, group="Second month") %>% 
  addCircleMarkers(data = third_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(third_months$confirmed), radius= ~confirmed*4, group="Third month") %>% 
addCircleMarkers(data = forth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(forth_months$confirmed), radius= ~confirmed*4, group="Forth month") %>% 
addCircleMarkers(data= fifth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(fifth_months$confirmed), radius= ~confirmed*4, group="Fifth month") %>% 
addCircleMarkers(data= sixth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(sixth_months$confirmed), radius= ~confirmed*4, group="Sixth month") %>% 
addCircleMarkers(data= second_6_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_6_months$confirmed), radius= ~confirmed*1, group="last 6 months") %>% 
addLayersControl(baseGroups = c("First month","Second month","Third month","Forth month", "Fifth month", "Sixth month","last 6 months"), options = layersControlOptions(collapsed = FALSE))

Plot 2021 Growth

# world_data <- data
# geo_code_merger <- select(geo_code, country, code) %>% group_by(country) %>% summarise(code=max(code))
# world_data_v2 <- merge(x = world_data , y = geo_code_merger, by.x = c("Country.Region"), by.y = ("country"))
# world_data_v2 <- world_data_v2[order(as.Date(world_data_v2$Date, format="%m.%d.%Y")),]
# df <- read.csv("graph.csv")

#p <- plot_geo(geo_code, locationmode = 'world') %>%
#add_trace( z = geo_code$new_cases_per_million, locations = geo_code$code, frame=geo_code$start_of_week,
#color = geo_code$new_cases_per_million) %>% colorbar(title = "Timeline") 
# p

#export as html file
# htmlwidgets::saveWidget(p, file = "map.html")



world_data <- data
first_months <- world_data %>% filter(year==21)  %>% filter(month==1) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long)) 
second_months <- world_data %>% filter(year==21)  %>% filter(month==2) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long)) 

third_months <- world_data %>% filter(year==21)  %>% filter(month==3) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long))

forth_months <- world_data %>% filter(year==21)  %>% filter(month==4) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long))

fifth_months <- world_data %>% filter(year==21)  %>% filter(month==5) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long)) 

sixth_months <- world_data %>% filter(year==21)  %>% filter(month==6) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long))
second_6_months <- world_data %>% filter(year==21)  %>% filter(month<=12) %>% group_by(Country.Region)  %>%
                    summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count),
                              lat = mean(Lat),
                              long = mean(Long))
pal = colorNumeric(
  palette = "viridis",
  domain = world_data$confirmed
)

leaflet() %>% 
  addProviderTiles("CartoDB", group="CartoBD",options=providerTileOptions(nowrap=TRUE)) %>%
addCircleMarkers(data = first_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(first_months$confirmed), radius= ~confirmed*0.000002, group="First month") %>% 
addCircleMarkers(data = second_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_months$confirmed), radius= ~confirmed*0.000002, group="Second month") %>% 

addCircleMarkers(data = third_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(third_months$confirmed), radius= ~third_months$confirmed*0.000002, group="Third month") %>% 
addCircleMarkers(data = forth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(forth_months$confirmed), radius= ~confirmed*0.000002, group="Forth month") %>% 
addCircleMarkers(data= fifth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(fifth_months$confirmed), radius= ~confirmed*0.000002, group="Fifth month") %>% 
addCircleMarkers(data= sixth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(sixth_months$confirmed), radius= ~confirmed*0.000002, group="Sixth month") %>% 
addCircleMarkers(data= second_6_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_6_months$confirmed), radius= ~confirmed*0.000002, group="last 6 months") %>% 
addLayersControl(baseGroups = c("First month","Second month","Third month","Forth month", "Fifth month", "Sixth month","last 6 months"), options = layersControlOptions(collapsed = FALSE))

Monthly analysis of COVID-19

Novel COVID 19 Stats


16.21 % of world`s corona virus are from US

8.62 % of world`s corona virus are from India

6.08 % of world`s corona virus are from Brazil

5.49 % of world`s corona virus are from France

4.84 % of world`s corona virus are from Germany

4.39 % of world`s corona virus are from United Kingdom

3.58 % of world`s corona virus are from Russia

3.37 % of world`s corona virus are from Korea, South

3.21 % of world`s corona virus are from Italy

3.01 % of world`s corona virus are from Turkey

world_data <- data %>% filter(year!=22)
world_data_by_countries <- world_data %>% group_by(Country.Region, month)  %>%
                              summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)

world_data_by_countries <- world_data_by_countries %>% group_by(month)  %>%
                              summarise(confirmed = sum(confirmed),
                              death = sum(death),
                              recovered = sum(recovered)) %>% arrange(month) %>% arrange(as.integer(month))

world_data_by_countries['confirmed_rev_cumsum'] <- c(world_data_by_countries$confirmed[1],diff(world_data_by_countries$confirmed))

world_data_by_countries['death_rev_cumsum'] <- c(world_data_by_countries$death[1],diff(world_data_by_countries$death))
world_data_by_countries['recovered_rev_cumsum'] <- c(world_data_by_countries$recovered[1],diff(world_data_by_countries$recovered))


template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white;background-image: linear-gradient(to left bottom, #a6e90d, #58d056, #00b374, #00937d, #2e7171);">
<h4>Novel COVID 19 Stats Monthly status 2020 & 2021</h4>
<hr style="border-top: 1px solid white;">
<b>According to data of 2020 and 2021, each months observed,</b>
'
template2 <- '
  <p>In the month of %s  we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
mnths <- c("Jan", "February","March","April","May","June", "July", "Auguest", "September", "October","November","December")
cat(template1)

Novel COVID 19 Stats Monthly status 2020 & 2021


According to data of 2020 and 2021, each months observed,

for (i in seq(nrow(world_data_by_countries))) {
  current <- world_data_by_countries[i, ]
  cat(sprintf(template2, mnths[as.integer(current$month)], (current$confirmed_rev_cumsum/sum(world_data_by_countries$confirmed_rev_cumsum))*100,(current$death_rev_cumsum/sum(world_data_by_countries$death_rev_cumsum))*100))
}

In the month of Jan we have observed 35.84 % total confirmed cases and a death rate of 42.13

In the month of February we have observed 3.90 % total confirmed cases and a death rate of 5.72

In the month of March we have observed 5.12 % total confirmed cases and a death rate of 5.52

In the month of April we have observed 7.80 % total confirmed cases and a death rate of 6.98

In the month of May we have observed 6.80 % total confirmed cases and a death rate of 7.00

In the month of June we have observed 3.91 % total confirmed cases and a death rate of 5.23

In the month of July we have observed 5.46 % total confirmed cases and a death rate of 4.98

In the month of Auguest we have observed 6.89 % total confirmed cases and a death rate of 5.53

In the month of September we have observed 5.56 % total confirmed cases and a death rate of 4.85

In the month of October we have observed 4.53 % total confirmed cases and a death rate of 3.98

In the month of November we have observed 5.44 % total confirmed cases and a death rate of 3.98

In the month of December we have observed 8.74 % total confirmed cases and a death rate of 4.10

world_data <- data %>% filter(year==20)
world_data_by_countries <- world_data %>% group_by(Country.Region, month)  %>%
                              summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)

world_data_by_countries <- world_data_by_countries %>% group_by(month)  %>%
                              summarise(confirmed = sum(confirmed),
                              death = sum(death),
                              recovered = sum(recovered)) %>% arrange(month) %>% arrange(as.integer(month))
world_data_by_countries['confirmed_rev_cumsum'] <- c(world_data_by_countries$confirmed[1],diff(world_data_by_countries$confirmed))

world_data_by_countries['death_rev_cumsum'] <- c(world_data_by_countries$death[1],diff(world_data_by_countries$death))
world_data_by_countries['recovered_rev_cumsum'] <- c(world_data_by_countries$recovered[1],diff(world_data_by_countries$recovered))


template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white; background-image: linear-gradient(to left bottom, #051937, #3c405e, #6e6c87, #a29cb3, #d7cfe1);">
<h4>Novel COVID 19 Stats Monthly status 2020</h4>
<hr style="border-top: 1px solid white;">
<b>According to data of 2020, each months observed,</b>
'
template2 <- '
  <p>In the month of %s  we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
mnths <- c("Jan", "February","March","April","May","June", "July", "Auguest", "September", "October","November","December")
cat(template1)

Novel COVID 19 Stats Monthly status 2020


According to data of 2020, each months observed,

world_data_by_countries_bkp <- world_data_by_countries
for (i in seq(nrow(world_data_by_countries))) {
  current <- world_data_by_countries[i, ]
  cat(sprintf(template2, mnths[as.integer(current$month)], (current$confirmed_rev_cumsum/sum(world_data_by_countries$confirmed_rev_cumsum))*100, (current$death_rev_cumsum/sum(world_data_by_countries$death_rev_cumsum))*100))
}

In the month of Jan we have observed 0.01 % total confirmed cases and a death rate of 0.01

In the month of February we have observed 0.08 % total confirmed cases and a death rate of 0.14

In the month of March we have observed 0.92 % total confirmed cases and a death rate of 2.22

In the month of April we have observed 2.84 % total confirmed cases and a death rate of 10.32

In the month of May we have observed 3.45 % total confirmed cases and a death rate of 7.96

In the month of June we have observed 5.15 % total confirmed cases and a death rate of 7.67

In the month of July we have observed 8.55 % total confirmed cases and a death rate of 9.44

In the month of Auguest we have observed 9.54 % total confirmed cases and a death rate of 9.83

In the month of September we have observed 10.17 % total confirmed cases and a death rate of 9.00

In the month of October we have observed 14.47 % total confirmed cases and a death rate of 9.83

In the month of November we have observed 20.57 % total confirmed cases and a death rate of 14.71

In the month of December we have observed 24.26 % total confirmed cases and a death rate of 18.85

world_data <- data %>% filter(year==21)
world_data_by_countries <- world_data %>% group_by(Country.Region, month)  %>%
                              summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)

world_data_by_countries <- world_data_by_countries %>% group_by(month)  %>%
                              summarise(confirmed = sum(confirmed),
                              death = sum(death),
                              recovered = sum(recovered)) %>% arrange(month) %>% arrange(as.integer(month))

world_data_by_countries['confirmed'] <- world_data_by_countries$confirmed - sum(world_data_by_countries_bkp$confirmed_rev_cumsum)


world_data_by_countries['confirmed_rev_cumsum'] <- c(world_data_by_countries$confirmed[1],diff(world_data_by_countries$confirmed))

world_data_by_countries['death_rev_cumsum'] <- c(world_data_by_countries$death[1],diff(world_data_by_countries$death))
world_data_by_countries['recovered_rev_cumsum'] <- c(world_data_by_countries$recovered[1],diff(world_data_by_countries$recovered))


template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white; background-image: linear-gradient(to right top, #051937, #004d7a, #008793, #00bf72, #a8eb12);">
<h4>Novel COVID 19 Stats Monthly status 2021</h4>
<hr style="border-top: 1px solid white;">
<b>According to data of 2021, each months observed,</b>
'
template2 <- '
  <p>In the month of %s  we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
mnths <- c("Jan", "February","March","April","May","June", "July", "Auguest", "September", "October","November","December")
cat(template1)

Novel COVID 19 Stats Monthly status 2021


According to data of 2021, each months observed,

for (i in seq(nrow(world_data_by_countries))) {
  current <- world_data_by_countries[i, ]
  cat(sprintf(template2, mnths[as.integer(current$month)], (current$confirmed_rev_cumsum/sum(world_data_by_countries$confirmed_rev_cumsum))*100,(current$death_rev_cumsum/sum(world_data_by_countries$death_rev_cumsum))*100))
}

In the month of Jan we have observed 9.54 % total confirmed cases and a death rate of 42.13

In the month of February we have observed 5.50 % total confirmed cases and a death rate of 5.72

In the month of March we have observed 7.22 % total confirmed cases and a death rate of 5.52

In the month of April we have observed 11.00 % total confirmed cases and a death rate of 6.98

In the month of May we have observed 9.59 % total confirmed cases and a death rate of 7.00

In the month of June we have observed 5.52 % total confirmed cases and a death rate of 5.23

In the month of July we have observed 7.70 % total confirmed cases and a death rate of 4.98

In the month of Auguest we have observed 9.71 % total confirmed cases and a death rate of 5.53

In the month of September we have observed 7.83 % total confirmed cases and a death rate of 4.85

In the month of October we have observed 6.39 % total confirmed cases and a death rate of 3.98

In the month of November we have observed 7.67 % total confirmed cases and a death rate of 3.98

In the month of December we have observed 12.32 % total confirmed cases and a death rate of 4.10

world_data <- data
world_data_by_countries <- world_data %>% group_by(Country.Region, month)  %>%
                              summarise(confirmed = max(confirmed_count),
                              death = max(death_count),
                              recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)
c1 <- world_data_by_countries %>% filter(month==c(10))
c2 <- world_data_by_countries %>% filter(month==c(12))
c2$confirmed <- c2$confirmed-c1$confirmed
c2$death <- c2$death-c1$death

c2 <- top_n(c2, 10, confirmed) %>% arrange(-confirmed)

template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white; background-image: linear-gradient(to right top, #051937, #004d7a, #008793, #00bf72, #a8eb12);">
<h4>Country with highest confirmed/death rates in last Quarter(Q4)</h4>
<hr style="border-top: 1px solid white;">
<b>According to data in Quarter(Q4),</b>
'
template2 <- '
  <p>In %s  we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
cat(template1)

Country with highest confirmed/death rates in last Quarter(Q4)


According to data in Quarter(Q4),

for (i in seq(nrow(c2))) {
  current <- c2[i, ]
  cat(sprintf(template2, current$Country.Region, (current$confirmed/sum(c2$confirmed))*100,(current$death/sum(c2$death))*100))
}

In US we have observed 33.68 % total confirmed cases and a death rate of 36.45

In United Kingdom we have observed 14.85 % total confirmed cases and a death rate of 3.62

In France we have observed 10.61 % total confirmed cases and a death rate of 2.68

In Germany we have observed 9.77 % total confirmed cases and a death rate of 7.34

In Russia we have observed 7.43 % total confirmed cases and a death rate of 31.03

In Turkey we have observed 5.55 % total confirmed cases and a death rate of 5.32

In Italy we have observed 5.18 % total confirmed cases and a death rate of 2.40

In Spain we have observed 4.91 % total confirmed cases and a death rate of 0.92

In Poland we have observed 4.15 % total confirmed cases and a death rate of 9.09

In Netherlands we have observed 3.86 % total confirmed cases and a death rate of 1.14

2020 2021
x x + y + ………+ Z
x+y x + y + ………+ Z + V
…. ……
…. ……

Now, x + (x + y + ………+ Z) –(1)

x + y + (x + y + ………+ Z + V) –(2) (2) - (1) y+v, which is the sum of individual vaules of each month.

Key Takeaways


  • The month of January observed highest no of cases, but its just was the beginning and the previous month confirmed data is not properly represented.
  • After January (analyzing data of 2020 and 2021 together) we can observe the rate of increase, over last quarter of october, november, december and january begining.
  • Now, to take more evidence on these hypothesis, when we look at 2020 and 2021 seperately, we can observe similar rate of increase over last quarter of the year.
  • Over the last quarter(Q4), we can observe that the highest cases are from US, UK, France, Germany. All these are countries are countries severly affected by winter, hence data shows that corona virus spreads more in winter. This also validates many some theories, Rates of COVID might increase in winter, but it’s not necessarily because the virus thrives in the cold
  • India eventhough it tops in covid cases, In quarter Q4 it is in comparatively better position.

Analysis by population

Now, let’s look at the top countries that have been affected, considering their population. We utilized an additional dataset for this investigation, which consisted of the name of the nation and its population. Click here for dataset. Here we joined covid 19 data with population data.

Ratio of total affected people vs population

top_20_countries <- top_20_countries_c

# library(readr)
# csvData <- read_csv("csvData.csv")
# csvData$pop2022 <- csvData$pop2022 *1000
csvData[c(csvData$country=="United States"),]['country'] ='US'
csvData[c(csvData$country=="South Korea"),]['country'] = "Korea, South"
top_20_countries <- merge(x =top_20_countries  , y = csvData, by.x = c("Country.Region"), by.y=c("country"))
top_20_countries <- top_20_countries[order(-top_20_countries$confirmed),]
top_20_countries['confirmed_to_pop_ratio'] <- top_20_countries$confirmed/top_20_countries$pop2022
top_20_countries['death_to_pop_ratio'] <- top_20_countries$death/top_20_countries$pop2022
top_20_countries <- top_20_countries[order(-top_20_countries$confirmed_to_pop_ratio),]
top_20_countries[c('Country.Region', 'confirmed', 'death', 'confirmed_to_pop_ratio')]

Ratio of total death people vs population

top_20_countries <- top_20_countries[order(-top_20_countries$death_to_pop_ratio),]
top_20_countries[c('Country.Region', 'confirmed', 'death', 'death_to_pop_ratio')]

Key Takeaways


  • By analyzing the above data frames, we can observe that by considering the total population into account of each countries, Netherlands affected the most.
  • Covid-19 affects 47 percent of the whole population in the Netherlands, followed by 41 percent in France.
  • The United States, which has the highest number of covid cases, with 27 percent of its population impacted.
  • Covid killed the most people in Brazil (0.3 percent of the population).
  • In India, covid affected 3% of it’s total population, with a death rate of 0.03%.

Analysis by country

Now lets select few top countries and analyse the data deeper. At first Let’s consider united states, the country that has shown very high number of confirmed cases

United States

library(ggplot2)

data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)
US <- country %>% filter(Country.Region=="US")
country
ggplot(US, aes(x=days, y=confirmed_count)) + geom_line(color="red") +
  theme_classic() +
  labs(title = "Covid-19 United States Confirmed Cases", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

ggplot(US, aes(x=days, y=death_count)) + geom_line(color="red") +
  theme_classic() +
  labs(title = "Covid-19 United States Death Cases", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5)) 
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

ggplot(US, aes(x=days, y=recovered_count)) + geom_line(color="red") +
  theme_classic() +
  labs(title = "Covid-19 United States Recovered Cases", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

drop <- c("Province.State")
country = country[,!(names(country) %in% drop)]

# Some inconsistancy with UK data hence ignoring
country <- country %>% filter(!Country.Region=="United Kingdom")

country <- country %>% filter(Country.Region==c(top_20_countries$Country.Region))
world_perspective <- ggplot(country, aes(x=days, y=confirmed_count, group=Country.Region, color=Country.Region)) + geom_line() + labs(title = "Covid-19 Confirmed Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

Here, we have data regarding how these covid affected in different provinces of united states. For many countries these data is not even present.

us_data_confirmed <- read.csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv')
us_data_confirmed <- us_data_confirmed %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "confirmed_count")
us_data_confirmed$Date <- substr(us_data_confirmed$Date,2,20)
us_data_confirmed <- us_data_confirmed %>% group_by(Province_State) %>% summarise(confirmed=max(confirmed_count), Lat=median(Lat), Long_=median(Long_))

lng<-mean(us_data_confirmed$Long_)
lat<-mean(us_data_confirmed$Lat)
pal = colorNumeric(
  palette = "viridis",
  domain = us_data_confirmed$`confirmed`
)
leaflet(us_data_confirmed) %>% addTiles() %>%
  addCircleMarkers(lng = ~Long_, lat = ~Lat,
                    label = ~Province_State,
                    color=~pal(us_data_confirmed$confirmed),
                   radius= ~confirmed*0.000015)%>%
addLegend( "bottomright", pal = pal, values = ~confirmed,
              title = "Total Affected",
              labFormat = labelFormat(prefix = " "),
              opacity = 0.75)%>%
  setView(lat= 35, lng=-100,zoom=4)

Australia

data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)
Australia <- country %>% filter(Country.Region=="Australia") %>% group_by(Date) %>% mutate(confirmed_count=sum(confirmed_count),
                                                                                    death_count=sum(death_count),
                                                                                    recovered_count=sum(recovered_count))

ggplot(Australia, aes(x=days, y=confirmed_count)) + geom_line(color="red") +
  theme_classic() +
  labs(title = "Covid-19 Australia Confirmed Cases", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

ggplot(Australia, aes(x=days, y=death_count)) + geom_line(color="red") +
  theme_classic() +
  labs(title = "Covid-19 Australia Death Cases", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5)) 
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

ggplot(Australia, aes(x=days, y=recovered_count)) + geom_line(color="red") +
  theme_classic() +
  labs(title = "Covid-19 Australia Recovered Cases", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

# Some inconsistancy with UK data hence ignoring
country <- country %>% filter(!Country.Region=="United Kingdom")


country <- country %>% filter(Country.Region==c(top_20_countries$Country.Region))
world_perspective <- ggplot(country, aes(x=days, y=confirmed_count, group=Country.Region, color=Country.Region)) + geom_line() + theme_classic() +
  labs(title = "Covid-19 Confirmed Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

Here, is the plot for how covid affected in different provinces of Australia.

data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)

us_data_confirmed <- country %>% filter(Country.Region=="Australia")
us_data_confirmed <- us_data_confirmed %>% group_by(Province.State) %>% summarise(confirmed=max(confirmed_count), Lat=median(Lat), Long_=median(Long))

lng<-mean(us_data_confirmed$Long_)
lat<-mean(us_data_confirmed$Lat)

pal = colorNumeric(
  palette = "viridis",
  domain = us_data_confirmed$`confirmed`
)

leaflet(us_data_confirmed) %>% addTiles() %>%
  addCircleMarkers(lng = ~Long_, lat = ~Lat,
                    label = ~Province.State,
                    color=~pal(us_data_confirmed$confirmed),
                   radius= ~confirmed*0.000025)%>%
addLegend( "bottomright", pal = pal, values = ~confirmed,
              title = "Total Affected",
              labFormat = labelFormat(prefix = " "),
              opacity = 0.75)%>%
  setView(lat= -30, lng=140,zoom=4)

Other Countries

data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)
country <- country %>% filter(Country.Region==c(top_20_countries$Country.Region))
world_perspective <- ggplot(country, aes(x=days, y=confirmed_count, group=Country.Region, color=Country.Region)) + geom_line() +
  theme_classic() +
  labs(title = "Covid-19 Confirmed Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~Country.Region)
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

world_perspective <- ggplot(country, aes(x=days, y=death_count, group=Country.Region, color=Country.Region)) + geom_line() +
  theme_classic() +
  labs(title = "Covid-19 Death Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~Country.Region)
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

world_perspective <- ggplot(country, aes(x=days, y=recovered_count, group=Country.Region, color=Country.Region)) + geom_line() +
  theme_classic() +
  labs(title = "Covid-19 recovery Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~Country.Region)
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

Where we stand?

In this section we are going to analyze situation in India. Since the data required for this particular analysis not present in the CSSEGISandData/COVID-19 repo we are using another dataset for this purpose.

Covid dataset

str(covid)
## 'data.frame':    18110 obs. of  9 variables:
##  $ Sno                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date                    : chr  "2020-01-30" "2020-01-31" "2020-02-01" "2020-02-02" ...
##  $ Time                    : chr  "6:00 PM" "6:00 PM" "6:00 PM" "6:00 PM" ...
##  $ State.UnionTerritory    : chr  "Kerala" "Kerala" "Kerala" "Kerala" ...
##  $ ConfirmedIndianNational : chr  "1" "1" "2" "3" ...
##  $ ConfirmedForeignNational: chr  "0" "0" "0" "0" ...
##  $ Cured                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Confirmed               : int  1 1 2 3 3 3 3 3 3 3 ...

Testing Dataset

str(testing)
## 'data.frame':    16336 obs. of  5 variables:
##  $ Date        : chr  "2020-04-17" "2020-04-24" "2020-04-27" "2020-05-01" ...
##  $ State       : chr  "Andaman and Nicobar Islands" "Andaman and Nicobar Islands" "Andaman and Nicobar Islands" "Andaman and Nicobar Islands" ...
##  $ TotalSamples: num  1403 2679 2848 3754 6677 ...
##  $ Negative    : int  1210 NA NA NA NA NA NA NA NA NA ...
##  $ Positive    : num  12 27 33 33 33 33 33 33 33 33 ...

Vaccine Dataset

str(vaccine)
## 'data.frame':    7644 obs. of  24 variables:
##  $ Updated.On                          : chr  "16/01/2021" "17/01/2021" "18/01/2021" "19/01/2021" ...
##  $ State                               : chr  "India" "India" "India" "India" ...
##  $ Total.Doses.Administered            : num  48276 58604 99449 195525 251280 ...
##  $ Sessions                            : num  3455 8532 13611 17855 25472 ...
##  $ Sites                               : num  2957 4954 6583 7951 10504 ...
##  $ First.Dose.Administered             : num  48276 58604 99449 195525 251280 ...
##  $ Second.Dose.Administered            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Male..Doses.Administered.           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Female..Doses.Administered.         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Transgender..Doses.Administered.    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Covaxin..Doses.Administered.        : num  579 635 1299 3017 3946 ...
##  $ CoviShield..Doses.Administered.     : num  47697 57969 98150 192508 247334 ...
##  $ Sputnik.V..Doses.Administered.      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ AEFI                                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X18.44.Years..Doses.Administered.   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X45.60.Years..Doses.Administered.   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X60..Years..Doses.Administered.     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X18.44.Years.Individuals.Vaccinated.: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X45.60.Years.Individuals.Vaccinated.: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X60..Years.Individuals.Vaccinated.  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Male.Individuals.Vaccinated.        : num  23757 27348 41361 81901 98111 ...
##  $ Female.Individuals.Vaccinated.      : num  24517 31252 58083 113613 153145 ...
##  $ Transgender.Individuals.Vaccinated. : num  2 4 5 11 24 38 80 103 128 201 ...
##  $ Total.Individuals.Vaccinated        : num  48276 58604 99449 195525 251280 ...

Detailed Analysis

Here, In this section we are planning to analyze indian data in different aspects.

full_covid_data <- inner_join(covid,testing, by=c("Date"="Date","State.UnionTerritory"="State"))
full_covid_data[is.na(full_covid_data)] <- 0
top_affected <- full_covid_data %>% group_by(State.UnionTerritory) %>% summarise(Cured=max(Cured), Deaths=max(Deaths), Confirmed=max(Confirmed)) %>%
  select(State.UnionTerritory,Cured,Deaths,Confirmed) %>%
  arrange(desc(Confirmed)) %>% top_n(10)
## Selecting by Confirmed
ta <- as.vector(top_affected[['State.UnionTerritory']])

Top Affected States - Growth

full_covid_data$Date <- as.Date(full_covid_data$Date)

full_covid_data %>% 
  filter(State.UnionTerritory %in% ta) %>% 
  ggplot(aes(x=Date,y=Confirmed)) + geom_line(aes(color=State.UnionTerritory),size=1.2)+
  scale_x_date(limit=c(as.Date("2020-04-01"),as.Date("2021-08-11"))) +
  theme_classic() +
  scale_y_continuous(labels=scales :: number_format(accuracy=1))+
  labs(title='Time Series for Confirmed Cases',subtitle = 'Top affected states')+
  xlab(label='Time Period') +
  ylab(label='Confirmed Cases') +
  scale_fill_viridis_d()

full_covid_data$Active = (full_covid_data$Confirmed-(full_covid_data$Deaths + full_covid_data$Cured))

full_covid_data %>% 
  filter(State.UnionTerritory %in% ta) %>% 
  ggplot(aes(x=Date,y=Active)) + geom_line(aes(color=State.UnionTerritory),size=1.2)+
  scale_x_date(limit=c(as.Date("2020-04-01"),as.Date("2021-05-07"))) +
  scale_y_continuous(labels=scales :: number_format(accuracy=1))+
  labs(title='Time Series for Active Cases',subtitle = 'Top 10 worst affected states')+
  theme_classic() +
  xlab(label='Time Period') +
  ylab(label='Active Cases') +
  scale_fill_viridis_d()

Top Affected States - Stat

full_covid_data %>% 
  filter(Date==max(Date)) %>% 
  ggplot(aes(x=Confirmed,y=State.UnionTerritory))+geom_col(fill='red',alpha=0.8)+
  scale_x_continuous(labels=scales :: number_format(accuracy=1))+
  theme_minimal() +
  labs(title="Total Confirmed cases grouped by states")

full_covid_data %>% 
  filter(Date==max(Date)) %>% 
  ggplot(aes(x=Active,y=State.UnionTerritory))+geom_col(fill='green',alpha=0.8)+
  scale_x_continuous(labels=scales :: number_format(accuracy=1))+
  theme_light() +
  labs(title="Total Confirmed cases grouped by states")

full_covid_data %>% 
  filter(Date==max(Date)) %>% 
  ggplot(aes(x=Deaths,y=State.UnionTerritory))+geom_col()+
  scale_fill_viridis_d() +
  scale_x_continuous(labels=scales :: number_format(accuracy=1))+
  theme_light() +
  labs(title="Total Confirmed cases grouped by states") 

Indian Perspective

Here is the deatailed plot for the growth of covid in India.

india<-full_covid_data %>% 
  group_by(Date) %>% 
  summarise(Cured_tot=sum(Cured),
            Deaths_tot=sum(Deaths),
            Confirmed_tot=sum(Confirmed),
            Active_tot=sum(Active))
plot_india <- india %>% 
  ggplot(aes(x=Date,y=Confirmed_tot)) + geom_line(color='blue',size=1) +
  labs(title="Times series for Confirmed Cases")+
  theme_linedraw() +
  xlab(label ="Time Period") +
  ylab(label="Confirmed Cases") +
  scale_y_continuous(labels = scales :: number_format(accuracy=1))
plot_india

Case Fatality/Cured Rate

library(lubridate)
# The transmute method in dplyr allows you to add new variables, especially computed ones. Unlike mutate, the transmute will #remove other columns by default. A common data wrangling task is to create new columns using computations on existing columns.
tbl_covid_19_india <- covid
colnames(tbl_covid_19_india) <- sub("/", "", colnames(tbl_covid_19_india), fixed = TRUE)
tbl_covid_19_india  <- tbl_covid_19_india %>% mutate(new_date = ymd(Date)) %>%
    transmute(
        Sno = Sno, 
        Date = new_date, 
        StateUnionTerritory = State.UnionTerritory, 
        ConfirmedIndianNational = ConfirmedIndianNational, 
        ConfirmedForeignNational = ConfirmedForeignNational, 
        Cured = Cured, 
        Deaths = Deaths, 
        Confirmed = Confirmed
        )
# tbl_covid_19_india
tbl_deaths_percentage_1 <- inner_join(tbl_covid_19_india, 
                                      tbl_covid_19_india %>% group_by(StateUnionTerritory) %>%
                                        summarise(max_date = max(Date)) %>% ungroup() %>% 
                                        transmute(StateUnionTerritory = StateUnionTerritory, 
                                                  Date = max_date), by = c("StateUnionTerritory", "Date")) 

# tbl_deaths_percentage_1
tbl_deaths_percentage <- mutate(tbl_deaths_percentage_1, 
                                new_StateUnionTerritory = str_replace(StateUnionTerritory, "#", ""),
                                new_StateUnionTerritory1 = str_replace(new_StateUnionTerritory, "Andaman and Nicobar Islands", "Andaman & Nicobar")) %>%
  transmute(state = new_StateUnionTerritory1, 
            Date = Date, 
            Cured = Cured, 
            Deaths = Deaths, 
            Confirmed = Confirmed)

# tbl_deaths_percentage
# COVID 19 India - Case Fatality Rate - % of Deaths/Confirmed Cases
p_death <-  tbl_deaths_percentage %>% group_by(state) %>%
  summarise(sum_cured = sum(Cured), 
            sum_deaths = sum(Deaths), 
            sum_confirmed = sum(Confirmed), 
            deaths_perc = round(sum(Deaths)/sum(Confirmed)*100, digits = 2)) %>%
            filter(deaths_perc != 0) %>%
            ggplot(mapping = aes(x = reorder(state, deaths_perc), y = deaths_perc)) + 
            geom_bar(mapping = aes(fill = state), stat = "identity", show.legend = FALSE) + 
            coord_flip() + 
            xlab("States/Union Territories") +
            ylab("% of Deaths/Confirmed") + 
            ggtitle("Case Fatality Rate - % of Deaths/Confirmed Cases") +
            scale_fill_viridis_d() + theme_minimal()

p_death

# COVID 19 India - % of Cured/Confirmed Cases
p_cured <- tbl_deaths_percentage %>% group_by(state) %>%
  summarise(sum_cured = sum(Cured), 
            sum_deaths = sum(Deaths), 
            sum_confirmed = sum(Confirmed), 
            cured_perc = round(sum(Cured)/sum(Confirmed)*100, digits = 2)) %>%
  filter(cured_perc != 0) %>% mutate(rown = row_number(desc(cured_perc))) %>% filter(rown <= 25) %>%
  ggplot(mapping = aes(x = reorder(state, cured_perc), y = cured_perc)) + 
  geom_bar(mapping = aes(fill = state), stat = "identity", show.legend = FALSE) + 
  coord_flip() + 
  xlab("States/Union Territories") +
  ylab("% of Cured/Confirmed") + 
  ggtitle("Case Cured Rate - % of Cured/Confirmed Cases") +
  scale_fill_viridis_d() + theme_minimal()

p_cured


Key Takeaways


  • Deaths of patients as a percentage of the total number of confirmed cases for all states Case Fatality Rate is another name for it. A percentage of higher than 6% may be cause for concern.
  • Cured patients by the total number of patients in a state as a % value.

Covid Testing Details

tbl_state_testing_details <- testing
tbl_state_testing_details <- transmute(tbl_state_testing_details, 
                                       Date = Date, 
                                       State = State, 
                                       TotalSamples = replace_na(TotalSamples, 0), 
                                       Negative = replace_na(Negative, 0), 
                                       Positive = replace_na(Positive, 0)
                                      )



p_testing_details <- tbl_state_testing_details %>% filter(TotalSamples != 0) %>% group_by(State) %>% 
  filter(Date == max(Date)) %>%
  ungroup() %>% transmute(
    Date = Date, 
    State = State, 
    Negative = ifelse(Negative == 0, TotalSamples - Positive, Negative), 
    Positive = ifelse(Positive == 0, TotalSamples - Negative, Positive), 
    TotalSamples = Negative + Positive
  ) %>%
  pivot_longer(c(Negative, Positive), names_to = "type", values_to = "Samples") %>% 
  ggplot(mapping = aes(x = reorder(State, desc(TotalSamples)), y = Samples)) + 
  geom_col(mapping = aes(fill = type), position = position_stack(reverse = TRUE), show.legend = TRUE) + 
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  coord_flip() +
  ylab("Total Samples Tested") + 
  xlab("State") + 
  ggtitle("Testing Volumes by States") +
  scale_fill_manual(values = c("orange", "red"))
p_testing_details

p_ratio_positive_tests <- tbl_state_testing_details %>% filter(TotalSamples != 0, Positive != 0) %>% group_by(State) %>% 
  filter(Date == max(Date)) %>%
  ungroup() %>% 
  mutate(Positive_test_ratio = round(Positive/TotalSamples, digits = 2), 
        rown = row_number(desc(Positive_test_ratio))) %>%
  filter(rown <= 20) %>% 
  ggplot(mapping = aes(x = reorder(State, desc(-Positive_test_ratio)), y = Positive_test_ratio)) + 
  geom_bar(mapping = aes(fill = State), stat = "identity", show.legend = FALSE) + 
  coord_flip() +
  ylab("Ratio of Positive Samples Tested") + 
  xlab("State") + 
  ggtitle("Test positivity by State") + 
  scale_fill_viridis_d()
p_ratio_positive_tests

Key Takeaways


  • The states that have been tested the most include Uttar Pradesh, Maharashtra, Karnataka, Tamil Nadu, Bihar, and Kerala. As a result, some of these states have more affected people. The health centre, however, may not be able to manage as many positively infected cases as the previous plots predicted.
  • Sikkim, Mizoram, and Chandigarh have had the least amount of testing.
  • Maharashtra, Delhi, Karnataka, and Kerala have a larger percentage of positive cases on testing than the rest of the country, indicating that the number of infected people will continue to rise if the testing rate is maintained or increased.
  • Surprisingly Kerala have not been testing much as shown in the first plot and the ratio of positive cases to the total cases are also in the lower end suggesting all a bit too rosy picture when it may not be so if testing rate is picked up. This needs a further investigation on the figures of Kerala and how far they are correct may be.
vaccine_na <- subset(vaccine, !is.na(Total.Doses.Administered)) 
vaccine_na$Updated.On <- as.Date(vaccine_na$Updated.On,format="%d/%m/%y")
vaccine_na <- vaccine_na %>% filter(State != 'India')

top_vaccine <- vaccine_na %>% 
  filter(Updated.On  ==max(Updated.On   )) %>%
  select(State,Total.Doses.Administered) %>% 
  arrange(desc(Total.Doses.Administered)) %>% 
  top_n(5)
## Selecting by Total.Doses.Administered
tv <- top_vaccine[['State']]
tv[6] <- 'Kerala' 
vaccine_na <- rename(vaccine_na,Date = Updated.On)
vaccine_na %>%
  filter(State %in% tv) %>% 
  ggplot(aes(x=Date,y=Total.Doses.Administered)) + geom_line(aes(color=State))+
  labs(title="Time Series for Doses Administered")

Vaccination Data

Now Let’s looking into age-wise and gender wise distribution of vaccination accross the country.

Vaccination Overall

Now let’s analyze the overall distribution of vaccine

vaccination_data <- vaccine_na %>% 
  group_by(Date) %>% 
  summarise(Date,tot = sum(Total.Doses.Administered),
            tot_cv=sum(Covaxin..Doses.Administered.),
            tot_cs=sum(CoviShield..Doses.Administered.),
            tot_m=sum(Male..Doses.Administered.),
            tot_f=sum(Female..Doses.Administered.),
            tot_t=sum(Transgender..Doses.Administered.),
            tot_i=sum(Total.Individuals.Vaccinated)) %>%
  summarise(Total_dose=mean(tot),
            Total_covaxi = mean(tot_cv),
            Total_covis =mean(tot_cs),
            Total_Male = mean(tot_m),
            Total_Female = mean(tot_f),
            Total_Transgender = mean(tot_t),
            Total_vaccinated = mean(tot_i))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups`
## argument.
vaccination_data %>% 
  ggplot(aes(x=Date)) + geom_area(aes(y=Total_dose,color='green'),fill='green',alpha=.3) +
  geom_area(aes(y=Total_Male,color='blue'),fill='blue',alpha=.3) +
  geom_area(aes(y=Total_Female,color='red'),fill='red',alpha=.3) +
  geom_area(aes(y=Total_Transgender,color='yellow'),fill='black',alpha=1) +
  labs(title="Time series for Vaccinated")  +
  xlab(label ="Time Period") +
  ylab(label="Total Vaccinated") +
  scale_y_continuous(labels = scales :: number_format(accuracy=1))+
  theme(legend.position="right")+
  scale_color_identity(name = "Legend",
                       breaks = c("green", "blue", "red","yellow"),
                       labels = c("Total Vaccinated", "Men", "Women","Transgender"),
                       guide = "legend")

Vaccination - Age Distribution

vaccine_bar <- vaccine_na %>% 
  filter(Date =="2020-03-16") %>% 
  select(State, X18.44.Years.Individuals.Vaccinated., X45.60.Years.Individuals.Vaccinated., X60..Years.Individuals.Vaccinated.)
vaccine_bar <- vaccine_bar %>% pivot_longer(cols = starts_with("X"), names_to = "Age Group", values_to = "value")
vaccine_bar <- subset(vaccine_bar, !is.na(value))

vaccine_bar %>% 
  filter(State %in% tv) %>% 
  ggplot(aes(x=State,y=value,fill=`Age Group`))+geom_bar(stat='identity',position = 'fill') +
  scale_fill_discrete(name='Age Group',
                      breaks=c('X18.44.Years.Individuals.Vaccinated.', 'X45.60.Years.Individuals.Vaccinated.','X60..Years.Individuals.Vaccinated.'),
                      labels=c('18 to 44','44 to 60','>60')) +
  ylab('Percentage') +
  theme_classic()+
  labs(title='Age group distribution for')

Modelling

Let’s look at some statistical modelling techniques as a last stage in the research to see if we can forecast or estimate certain variables. We can use modelling in a variety of ways here. Time series analysis/forcasting is one prominent method. However, because time series forecasting methods are outside the scope of this study, we are disregarding them. Another method is to anticipate the value by determining the relationship between several factors. So we’re trying to see whether we can predict the death count by looking at other variables such as confirmed counts, countries, and so on. For this, we are trying to utilize linear model from “statistical Modeling” library and CART(Classification and Regression Tree).


  • We can see that the discrepancy between real and predicted values is quite large in all of these examples, thus we must conclude that we are unable to identify a relationship using the simple regression/decision tree techniques.
  • It’s possible that by adding more features, we’ll be able to uncover these connections.
  • It is also believed that neural networks would be able to find these intricate relationships, but these trials will be ignored because they are beyond the scope of this study.


LR - Type 1

library(statisticalModeling)
#model <- lm(net~age, data = Runners)
data_model <- world_data_by_countries

smp_size <- floor(0.75 * nrow(data_model))
set.seed(123)
train_ind <- sample(seq_len(nrow(data_model)), size = smp_size)
train <- data_model[train_ind, ]
test <- data_model[-train_ind, ]
model1 <- lm(death~confirmed, data = train)
result  = evaluate_model(model1, data = test)
fmodel(model1)

cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 1263026095

LR - Type 2

library(statisticalModeling)
#model <- lm(net~age, data = Runners)
data_model <- world_data_by_countries

smp_size <- floor(0.75 * nrow(data_model))
set.seed(123)
train_ind <- sample(seq_len(nrow(data_model)), size = smp_size)
train <- data_model[train_ind, ]
test <- data_model[-train_ind, ]
model1 <- lm(death~confirmed+Country.Region, data = train)
result  = evaluate_model(model1, data = test)
fmodel(model1)

cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 106289582

LR - Type 3

library(statisticalModeling)
#model <- lm(net~age, data = Runners)
data_model <- world_data_by_countries

smp_size <- floor(0.75 * nrow(data_model))
set.seed(123)
train_ind <- sample(seq_len(nrow(data_model)), size = smp_size)
train <- data_model[train_ind, ]
test <- data_model[-train_ind, ]
model1 <- lm(death~confirmed+Country.Region, data = train)
result  = evaluate_model(model1, data = test)
fmodel(model1)

cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 106289582

CART - 1

library(rpart)
rpart_1<-rpart(death~confirmed,data=train,cp=0.02)
result  = evaluate_model(rpart_1, data = test)
fmodel(rpart_1)

cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 1422784267

CART - 2

library(rpart)
rpart_2<-rpart(death~confirmed,data=train,cp=0.000002)
result  = evaluate_model(rpart_2, data = test)
fmodel(rpart_2)

cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 1303754682

CART - 3

library(rpart)
rpart_3<-rpart(death~confirmed+Country.Region,data=train,cp=0.0002)
result  = evaluate_model(rpart_3, data = test)
fmodel(rpart_3)

cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 191932715

Results and Discussions


So that brings us to the conclusion of our investigation. Before we wrap up, let’s take a quick look at everything we’ve done so far, what impact it’s had, and how it’s helping us create bigger and better covid resistance. We started by looking at worldwide covid data. We preprocessed and combined the data in the format we needed after noticing that it wasn’t in the appropriate format for our research. Then we began experimenting with various data research methods in order to gain insight into how Covid began, where regions it impacted, what impact it had over time, how it impacted in different seasons/months, and so on. We jotted down all of our observations in the key takeaways section at the appropriate locations. We then used a different dataset to look at the effects of covid on the population as well as the covid situation that occurred in our country. We tried our hardest to examine and investigate the effects of pandemics, as we said in our objectives.

This analysis, like all other analyses, has some limitations. The absence of data is one of the most significant limitations. We discovered a significant missing in the covid-19 recovery data. Within the recorded data, we’ve also noticed some inconsistency. We also know that some countries have been accused of under reporting, which could lead to some incorrect interpretations of data.